In assessing the categorical features, it appears that having diabetes and being a smoker are associated with an increased risk of intubation or death. The use of immunosuppressants also seems to be associated with increased risk but very few patients appear to be taking an immunosuppressant. The continuous outcomes also seem to have different distributions. The ages of the patients who experience an event is much lower than that of patients who do not experience an event. Similarly, the duration of symptoms is much longer on average for a patient without an event. Finally, the BMIs are on average higher for the patients without an event.
## Continuous varaibles for patients with event:
| age | bmi | duration_symptoms | |
|---|---|---|---|
| Min. | 5.408467 | 9.861328 | 1.000000 |
| 1st Qu. | 46.445732 | 22.435642 | 5.000000 |
| Median | 59.334673 | 25.998552 | 9.000000 |
| Mean | 58.638100 | 26.002439 | 8.212403 |
| 3rd Qu. | 70.614480 | 29.197509 | 10.000000 |
| Max. | 95.083451 | 50.830828 | 27.000000 |
## Continuous varaibles for patients without event:
| age | bmi | duration_symptoms | |
|---|---|---|---|
| Min. | 16.08475 | 15.89455 | 1.000000 |
| 1st Qu. | 62.04439 | 24.93572 | 6.000000 |
| Median | 73.54337 | 28.51278 | 9.000000 |
| Mean | 71.78282 | 29.52817 | 9.530758 |
| 3rd Qu. | 82.09736 | 32.64023 | 12.000000 |
| Max. | 113.67434 | 58.90469 | 35.000000 |
To clean the data, I calculated the timepoint in hours relative to the initial reading for all labs and for all subjects. I also linearly imputed the missing data using the imputeTS package. Originally, I intended to use the ARIMA imputation method given the autoregressive nature of the data, but this resulted in physiologically impossible imputed values given the nature of the repeated NA values.
To engineer the features, first, I selected summary statistics of the lab values to add to the features, including mean, median, minimum, and maximum values. Initially, I’d intended to incorporate some trend features. However, the measures demonstrated autoregressive properties, appearing to be cyclical over the course of the day but more or less stationary. I also added measures of the distribution characterstics, including skew and kurtosis for each variable.
Finally, I compared the lab values that were provided to published literature to see if any are recongized as strong predictors of intubation in COVID-19 patients. One publication cited oxygen saturation (SpO2) < 90% and respiratory rate >24 breaths/min as key features of the presentation of patients who required intubation.1 The WHO guidelines classify severe pneumonia in COVID-19 patients as SpO2 <93%, respiratory rate >30 breaths/min, or severe respiratory distress.2 As such, I created features that represented percentage of readings where the respiratory rate was greater than 24, 25, 28, 30, 32, and 34 breaths/min and oxygen saturation was less than 85%, 87%, and 90%.
| Event | Lab Measure | Minimum | Mean | Median | Maximum | |
|---|---|---|---|---|---|---|
| 1 | No | diastolic | 48.47460 | 60.15270 | 60.15916 | 69.58681 |
| 6 | Yes | diastolic | 49.45293 | 59.78746 | 59.79267 | 69.92931 |
| 2 | No | heart_rate | 65.11754 | 75.28557 | 75.26942 | 84.53421 |
| 7 | Yes | heart_rate | 65.29784 | 74.65946 | 74.62560 | 85.26364 |
| 3 | No | resp_rate | 19.83511 | 30.01062 | 30.02795 | 39.95080 |
| 8 | Yes | resp_rate | 20.65072 | 29.95777 | 29.93250 | 40.62065 |
| 4 | No | spo2 | 83.26979 | 92.44655 | 92.40854 | 103.20654 |
| 9 | Yes | spo2 | 82.40515 | 92.51879 | 92.51551 | 102.09935 |
| 5 | No | systolic | 120.17357 | 130.10846 | 130.12741 | 139.71925 |
| 10 | Yes | systolic | 119.22616 | 129.86097 | 129.84965 | 139.43613 |
Given the number and diversity of types of features, I first tried tree-based methods optimized for performance with gradient boosting and AdaBoost. This is because tree-based models can partition the feature space in a non-parametric manner. Gradient boosting and AdaBoost take an otherwise weak learner of a decision tree and improve performance by learning from previous trees in a manner that reduces error. Additionally, these while these are both “black box” methods, they can provide relative feature importance, which allows them to be useful both for inference and prediction.
For both methods, I first selected the tuning parameters of interaction depth, number of trees, and shrinkage using GridSearch and 5-fold cross-validation. Then I fit the model on the training data, and assessed performance predominantly on the testing error (i.e., generalization error).
## The optimal shrinkage parameter is 0.009 out of the range of 0 - 0.01 tested.
## The optimal number of trees is 5000 out of the range of 1000 - 5000 tested.
## The optimal interaction depth is 2 out of the range of 1 - 2 tested.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 137 29
## 1 46 123
##
## Accuracy : 0.7761
## 95% CI : (0.7276, 0.8196)
## No Information Rate : 0.5463
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5526
##
## Mcnemar's Test P-Value : 0.06467
##
## Sensitivity : 0.7486
## Specificity : 0.8092
## Pos Pred Value : 0.8253
## Neg Pred Value : 0.7278
## Prevalence : 0.5463
## Detection Rate : 0.4090
## Detection Prevalence : 0.4955
## Balanced Accuracy : 0.7789
##
## 'Positive' Class : 0
##
## The optimal shrinkage parameter is 0.009 out of the range of 0 - 0.01 tested.
## The optimal number of trees is 5000 out of the range of 1000 - 5000 tested.
## The optimal interaction depth is 2 out of the range of 1 - 2 tested.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 135 31
## 1 48 121
##
## Accuracy : 0.7642
## 95% CI : (0.715, 0.8086)
## No Information Rate : 0.5463
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5287
##
## Mcnemar's Test P-Value : 0.07184
##
## Sensitivity : 0.7377
## Specificity : 0.7961
## Pos Pred Value : 0.8133
## Neg Pred Value : 0.7160
## Prevalence : 0.5463
## Detection Rate : 0.4030
## Detection Prevalence : 0.4955
## Balanced Accuracy : 0.7669
##
## 'Positive' Class : 0
##
Both gradient boosting and AdaBoost allow for assessment of feature importance. Using both methodologies, the order of feature importance was roughly similar. Overall, age and BMI were highly important, followed by various lab measures and duration of symptoms. In general, the continuous variables had a greater influence on the outcome than the categorical features. Of the categorical features, cancer smoking status, and diabetes ranked among the highest. Of the features I engineered based on the literature, as suggested by the histograms, SpO2 < 90% and respiratory rate > 28 breaths per minute were the most important.
To compare which method was preferable, I assessed the ROC curve AUC for the training and testing data. The performance was incredible similar with ROC AUC of 0.9 for both models. Since the training ROC AUC was also quite similar, it was challenging to assess whether one model was more “overfit” relative to the other, thus demonstrating high bias. Therefore, I also assessed the variance of the models by looking at the standard error of the cross-validation error from the model training output. Using this output, the AdaBoost model appeared to have less variance, and thus I would select this method as preferable.
| Features | Importance, GBM | Importance, AdaBoost | |
|---|---|---|---|
| 1 | age | 14.6929686 | 14.3372478 |
| 3 | bmi | 11.7320259 | 11.5205599 |
| 24 | heart_rate.mean | 8.2223480 | 7.8402792 |
| 25 | heart_rate.median | 5.6659504 | 5.7081920 |
| 12 | diastolic.max | 5.3589183 | 5.6541987 |
| 13 | diastolic.mean | 3.9473109 | 3.5871579 |
| 17 | duration_symptoms | 2.6966765 | 3.0273976 |
| 14 | diastolic.median | 2.6057753 | 2.6351449 |
| 59 | systolic.median | 2.5352015 | 2.4616747 |
| 61 | systolic.skew | 2.4907757 | 2.4649863 |
| 26 | heart_rate.min | 2.2999222 | 2.3734553 |
| 57 | systolic.max | 2.2231788 | 2.2615455 |
| 38 | resp_rate.kurtosis | 2.0459262 | 2.0074059 |
| 58 | systolic.mean | 1.8950444 | 1.8855952 |
| 56 | systolic.kurtosis | 1.8031273 | 1.9575278 |
| 47 | spo2.max | 1.5831948 | 1.8680507 |
| 46 | spo2.kurtosis | 1.5452475 | 1.4677714 |
| 54 | spo2_90 | 1.5196014 | 1.4480661 |
| 50 | spo2.min | 1.4013477 | 1.6473623 |
| 60 | systolic.min | 1.3959180 | 1.5195242 |
| 11 | diastolic.kurtosis | 1.3931699 | 1.2962214 |
| 16 | diastolic.skew | 1.3906031 | 1.3003317 |
| 23 | heart_rate.max | 1.3882695 | 1.3152081 |
| 43 | resp_rate.skew | 1.3139206 | 1.2322535 |
| 15 | diastolic.min | 1.2632282 | 1.1223651 |
| 27 | heart_rate.skew | 1.2611701 | 1.1502193 |
| 42 | resp_rate.min | 1.2047795 | 1.3334680 |
| 34 | resp_28 | 1.1334608 | 1.3601187 |
| 55 | spo2_93 | 1.0353126 | 0.9139220 |
| 22 | heart_rate.kurtosis | 0.9600189 | 1.1151746 |
| 5 | cancer | 0.9393139 | 0.9896885 |
| 36 | resp_32 | 0.9343096 | 1.0028440 |
| 39 | resp_rate.max | 0.9085980 | 0.9886775 |
| 45 | smoke_vape | 0.7660981 | 0.8229905 |
| 51 | spo2.skew | 0.7543462 | 0.7602181 |
| 33 | resp_26 | 0.7070544 | 0.7146722 |
| 41 | resp_rate.median | 0.6614654 | 0.6690283 |
| 35 | resp_30 | 0.5698405 | 0.4764221 |
| 49 | spo2.median | 0.4914154 | 0.4814679 |
| 48 | spo2.mean | 0.4297872 | 0.4823535 |
| 37 | resp_34 | 0.4064184 | 0.3971066 |
| 40 | resp_rate.mean | 0.3800149 | 0.3988914 |
| 53 | spo2_87 | 0.3381063 | 0.2927159 |
| 9 | diabetes | 0.3206089 | 0.3017999 |
| 19 | ed_before_order_set | 0.2435553 | 0.2151431 |
| 64 | xray_pleural_effusion | 0.2008500 | 0.2464926 |
| 21 | fever | 0.1972144 | 0.2178333 |
| 31 | nausea_vomit | 0.1456769 | 0.1560759 |
| 63 | xray_clear | 0.1354004 | 0.1772019 |
| 32 | resp_24 | 0.0755645 | 0.0693941 |
| 44 | sex | 0.0627708 | 0.0594268 |
| 30 | myalgias | 0.0572895 | 0.0428114 |
| 52 | spo2_85 | 0.0406701 | 0.0340323 |
| 28 | hypertension | 0.0352648 | 0.0514657 |
| 8 | cough | 0.0352396 | 0.0329172 |
| 10 | diarrhea | 0.0351415 | 0.0216003 |
| 65 | xray_unilateral_infiltrate | 0.0260030 | 0.0142644 |
| 29 | hypoxia | 0.0238360 | 0.0151726 |
| 62 | xray_bilateral_infiltrates | 0.0215882 | 0.0088282 |
| 20 | esrd | 0.0128579 | 0.0198408 |
| 7 | copd | 0.0105663 | 0.0088362 |
| 18 | dyspnea | 0.0101991 | 0.0093264 |
| 2 | any_immunosuppression | 0.0092275 | 0.0051901 |
| 6 | ckd | 0.0060096 | 0.0028444 |
| 4 | cad | 0.0033050 | 0.0000000 |
## The standard error of cross-validation error with gradient boosting is 0.997%
## The standard error of cross-validation error with AdaBoost is 0.900%
Then, I wanted to test if another methodology would potentially improve upon the performance of a tree-based model. I selected another “black box” model: a support vector machine with a radial kernel. Due to the radial kernel’s ability to project the data into infinite dimensions, it allows the SVM to find separating hyperplanes for data that is otherwise not strictly separable. However, interpretability is diminished with this approach. We can, however, output relative feature importance, just as we did with the tree-based models.
To build this model, I first scaled the features and then selected the tuning parameters C and \(\sigma\) using Grid-Search and 5-fold cross-validation. Based on the cross-valdiation accuracy, the model with \(\sigma\) of 0.1 suffered from overfitting and performed very poorly, consistent with expected results from higher values of this tuning parameter. Overall, the the cross-validation accuracy was the highest for the model fit with a \(\sigma\) of 0.001. This is also apparent when comparing the ROC curves for the training and testing error, where the test performance is best \(\sigma\) of 0.001 despite having the lowest training AUC.
The feature importance is roughly similar to those resulting from the tree-based methods. Again, the continuous variables appeared to be the most important features, with age, heart rate, and BMI as the top 3 features. Of the categorical features, diabetes, admitted to ED before lab order set change, fever, myalgias, and smoking status were the most important.
| No | Yes | |
|---|---|---|
| age | 100.0000000 | 100.0000000 |
| heart_rate.mean | 76.7510305 | 76.7510305 |
| heart_rate.median | 72.0219731 | 72.0219731 |
| bmi | 63.0554928 | 63.0554928 |
| heart_rate.min | 50.8030869 | 50.8030869 |
| heart_rate.max | 39.6342840 | 39.6342840 |
| diastolic.max | 36.5707645 | 36.5707645 |
| diastolic.mean | 36.3065526 | 36.3065526 |
| diastolic.median | 35.0691048 | 35.0691048 |
| systolic.median | 30.6494093 | 30.6494093 |
| duration_symptoms | 27.0817134 | 27.0817134 |
| systolic.mean | 26.9086379 | 26.9086379 |
| systolic.max | 21.6545012 | 21.6545012 |
| diabetes | 18.1687444 | 18.1687444 |
| systolic.skew | 17.4455063 | 17.4455063 |
| ed_before_order_set | 15.2306419 | 15.2306419 |
| systolic.kurtosis | 14.0308191 | 14.0308191 |
| heart_rate.skew | 13.4873454 | 13.4873454 |
| spo2_93 | 11.4564259 | 11.4564259 |
| fever | 10.9497412 | 10.9497412 |
| myalgias | 10.7499101 | 10.7499101 |
| spo2.kurtosis | 10.4254981 | 10.4254981 |
| smoke_vape | 10.2984089 | 10.2984089 |
| spo2.mean | 9.0241720 | 9.0241720 |
| xray_bilateral_infiltrates | 8.8795244 | 8.8795244 |
| spo2_90 | 8.4648122 | 8.4648122 |
| resp_rate.skew | 8.3703314 | 8.3703314 |
| cancer | 8.3661508 | 8.3661508 |
| diastolic.min | 8.1094640 | 8.1094640 |
| spo2.median | 7.5960903 | 7.5960903 |
| dyspnea | 7.0894056 | 7.0894056 |
| spo2.min | 6.4489427 | 6.4489427 |
| resp_26 | 6.2399144 | 6.2399144 |
| systolic.min | 6.2014532 | 6.2014532 |
| resp_rate.kurtosis | 5.9790470 | 5.9790470 |
| esrd | 5.8469411 | 5.8469411 |
| sex | 5.6838990 | 5.6838990 |
| spo2_87 | 5.5225291 | 5.5225291 |
| diarrhea | 5.4974457 | 5.4974457 |
| resp_30 | 5.1094891 | 5.1094891 |
| diastolic.kurtosis | 3.9723748 | 3.9723748 |
| cad | 3.8628440 | 3.8628440 |
| resp_rate.max | 3.7215408 | 3.7215408 |
| any_immunosuppression | 3.1320808 | 3.1320808 |
| hypoxia | 2.8319161 | 2.8319161 |
| resp_28 | 2.7232214 | 2.7232214 |
| xray_unilateral_infiltrate | 2.5551626 | 2.5551626 |
| resp_24 | 2.4983069 | 2.4983069 |
| resp_34 | 2.4088427 | 2.4088427 |
| ckd | 2.4071705 | 2.4071705 |
| resp_rate.median | 2.1295809 | 2.1295809 |
| cough | 2.1136947 | 2.1136947 |
| xray_pleural_effusion | 2.0008194 | 2.0008194 |
| resp_rate.mean | 1.9138636 | 1.9138636 |
| xray_clear | 1.6070100 | 1.6070100 |
| spo2.skew | 1.3402898 | 1.3402898 |
| resp_32 | 1.0484862 | 1.0484862 |
| spo2_85 | 0.9105275 | 0.9105275 |
| copd | 0.6203962 | 0.6203962 |
| diastolic.skew | 0.6162156 | 0.6162156 |
| spo2.max | 0.5978211 | 0.5978211 |
| heart_rate.kurtosis | 0.5643766 | 0.5643766 |
| resp_rate.min | 0.3921372 | 0.3921372 |
| hypertension | 0.2006672 | 0.2006672 |
| nausea_vomit | 0.0000000 | 0.0000000 |
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 138 34
## Yes 39 138
##
## Accuracy : 0.7908
## 95% CI : (0.7443, 0.8323)
## No Information Rate : 0.5072
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5817
##
## Mcnemar's Test P-Value : 0.6397
##
## Sensitivity : 0.7797
## Specificity : 0.8023
## Pos Pred Value : 0.8023
## Neg Pred Value : 0.7797
## Prevalence : 0.5072
## Detection Rate : 0.3954
## Detection Prevalence : 0.4928
## Balanced Accuracy : 0.7910
##
## 'Positive' Class : No
##
## The standard error of cross-validation error with SVM is 1.316%
Since many of the features, particularly those we engineered, are correlated to one another, this precludes us from deploying typical generalized linear models, such as logistic regression due to multicollinearity, particularly because we are concerned both with inference and prediction.
However, a regularized regression model, such as Lasso could be used to generate interpretable coefficients. As with SVM, the first step to interpret the relative importance of the features is to scale them. Next, I used cross-validation to determine the optimal \(\lambda\) shrinakge parameter. Then I fit the model and assessed performance using training, testing, and cross-validation error. Additionally, I included all interaction terms in the model to attempt to understand how the features influenced one another.
While the other models have provided relative feature importance, I had not yet investigated how these features affect the outcome because the tree-based models are non-parametric and I had not evaluated the coefficients of the SVM. Looking at the coefficients, it becomes clear that increased age, heart rate, BMI, and systolic blood pressure are associated with decreased risk of intubation and/or death.
All of the top features associated with an increased risk of intubation or death are interaction terms. Top interaction terms associated with an increased risk of an event are having diabetes and dyspnea, having a pleural effusion on x-ray and respiratory rate above 32 breaths/min, being a smoker and having dyspnea, and being a smoker and being hypoxic.
With the other methods, there were multiple features generated from the same lab value terms (e.g., heart rate mean and median) in the top important features. With the Lasso regression, this occured less frequently as the coefficients of repetitive features was shrunk to zero.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 139 46
## Yes 38 126
##
## Accuracy : 0.7593
## 95% CI : (0.7109, 0.8032)
## No Information Rate : 0.5072
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5182
##
## Mcnemar's Test P-Value : 0.445
##
## Sensitivity : 0.7853
## Specificity : 0.7326
## Pos Pred Value : 0.7514
## Neg Pred Value : 0.7683
## Prevalence : 0.5072
## Detection Rate : 0.3983
## Detection Prevalence : 0.5301
## Balanced Accuracy : 0.7589
##
## 'Positive' Class : No
##
## The standard error of cross-validation error with a Lasso regression is 1.364%
## 2052 variables out of 2145 had coefficients reduced to zero.
## 93 variables out of 2145 had coefficients greater than zero.
| features | coefficeints | odd_ratio | |
|---|---|---|---|
| 1 | age | -1.0350460 | 0.3552100 |
| 32 | heart_rate.mean | -0.7848897 | 0.4561700 |
| 3 | bmi | -0.7341888 | 0.4798946 |
| 388 | diabetesYes:dyspneaChecked | 0.4174314 | 1.5180573 |
| 1273 | xray_pleural_effusionChecked:resp_32 | 0.4080340 | 1.5038582 |
| 1114 | xray_clearChecked:duration_symptoms | -0.3790310 | 0.6845244 |
| 724 | cancerYes:xray_bilateral_infiltratesChecked | -0.3655983 | 0.6937814 |
| 329 | smoke_vapeYes:dyspneaChecked | 0.2746908 | 1.3161237 |
| 40 | systolic.median | -0.2383704 | 0.7879108 |
| 799 | any_immunosuppressionYes:systolic.max | -0.2222619 | 0.8007057 |
| 255 | hypoxiaYes:smoke_vapeYes | 0.2088478 | 1.2322575 |
| 795 | any_immunosuppressionYes:diastolic.max | -0.2056084 | 0.8141519 |
| 1076 | dyspneaChecked:diastolic.mean | -0.1886905 | 0.8280428 |
| 1218 | xray_bilateral_infiltratesChecked:diastolic.max | -0.1841598 | 0.8318029 |
| 151 | sexMale:duration_symptoms | -0.1810089 | 0.8344279 |
| 650 | esrdChecked:resp_34 | 0.1530776 | 1.1654155 |
| 837 | feverChecked:heart_rate.mean | -0.1511557 | 0.8597138 |
| 385 | diabetesYes:diarrheaChecked | 0.1407827 | 1.1511745 |
| 607 | esrdChecked:cancerYes | -0.1395897 | 0.8697150 |
| 914 | coughChecked:systolic.skew | 0.1285054 | 1.1371276 |
| 41 | diastolic.max | -0.1252398 | 0.8822853 |
| 1069 | dyspneaChecked:duration_symptoms | -0.1250613 | 0.8824428 |
| 1284 | xray_pleural_effusionChecked:heart_rate.kurtosis | -0.1244112 | 0.8830167 |
| 204 | bmi:coughChecked | -0.1216487 | 0.8854593 |
| 514 | copdChecked:systolic.min | -0.1154708 | 0.8909466 |
| 547 | copdChecked:spo2.kurtosis | 0.1124590 | 1.1190264 |
| 462 | hypertensionYes:systolic.mean | -0.1111172 | 0.8948339 |
| 1152 | xray_clearChecked:diastolic.kurtosis | 0.1105465 | 1.1168883 |
| 634 | esrdChecked:spo2.median | -0.1044173 | 0.9008493 |
| 745 | cancerYes:resp_rate.max | 0.0919102 | 1.0962664 |
| 391 | diabetesYes:xray_bilateral_infiltratesChecked | 0.0804692 | 1.0837955 |
| 2091 | resp_34:resp_rate.skew | 0.0803895 | 1.0837091 |
| 482 | hypertensionYes:resp_34 | -0.0793640 | 0.9237036 |
| 403 | diabetesYes:spo2.mean | 0.0792707 | 1.0824973 |
| 1117 | xray_clearChecked:heart_rate.min | -0.0787808 | 0.9242425 |
| 836 | feverChecked:diastolic.mean | -0.0774835 | 0.9254423 |
| 1337 | ed_before_order_setYes:heart_rate.median | -0.0765701 | 0.9262880 |
| 275 | hypoxiaYes:ed_before_order_setYes | -0.0749187 | 0.9278189 |
| 18 | myalgiasChecked | -0.0684763 | 0.9338156 |
| 769 | any_immunosuppressionYes:coughChecked | 0.0659893 | 1.0682152 |
| 347 | smoke_vapeYes:heart_rate.median | -0.0646944 | 0.9373539 |
| 1360 | ed_before_order_setYes:heart_rate.skew | 0.0617177 | 1.0636620 |
| 1304 | duration_symptoms:spo2.max | -0.0572311 | 0.9443758 |
| 334 | smoke_vapeYes:duration_symptoms | -0.0559055 | 0.9456285 |
| 948 | diarrheaChecked:spo2.max | 0.0527297 | 1.0541446 |
| 613 | esrdChecked:myalgiasChecked | 0.0524342 | 1.0538332 |
| 163 | sexMale:diastolic.median | -0.0498722 | 0.9513510 |
| 66 | age:sexMale | -0.0490329 | 0.9521498 |
| 460 | hypertensionYes:resp_rate.mean | -0.0490045 | 0.9521768 |
| 31 | diastolic.mean | -0.0470021 | 0.9540854 |
| 413 | diabetesYes:spo2.max | 0.0437650 | 1.0447369 |
| 1723 | diastolic.median:spo2_93 | 0.0432961 | 1.0442471 |
| 24 | duration_symptoms | -0.0410471 | 0.9597840 |
| 2145 | systolic.kurtosis:heart_rate.kurtosis | 0.0394980 | 1.0402884 |
| 949 | diarrheaChecked:systolic.max | -0.0394520 | 0.9613161 |
| 1341 | ed_before_order_setYes:diastolic.max | -0.0370975 | 0.9635822 |
| 78 | age:feverChecked | -0.0360910 | 0.9645525 |
| 1025 | myalgiasChecked:diastolic.min | -0.0360690 | 0.9645738 |
| 25 | ed_before_order_setYes | -0.0342972 | 0.9662843 |
| 458 | hypertensionYes:diastolic.mean | -0.0336352 | 0.9669241 |
| 137 | sexMale:esrdChecked | 0.0324941 | 1.0330278 |
| 1416 | heart_rate.min:spo2.median | 0.0290707 | 1.0294973 |
| 91 | age:heart_rate.min | -0.0283885 | 0.9720107 |
| 409 | diabetesYes:systolic.median | -0.0279008 | 0.9724848 |
| 1562 | diastolic.mean:resp_rate.max | -0.0270074 | 0.9733540 |
| 120 | age:resp_rate.skew | 0.0251900 | 1.0255099 |
| 396 | diabetesYes:heart_rate.min | -0.0240975 | 0.9761905 |
| 1872 | heart_rate.max:systolic.max | -0.0240638 | 0.9762235 |
| 1942 | systolic.max:resp_28 | -0.0236850 | 0.9765933 |
| 1583 | diastolic.mean:systolic.kurtosis | -0.0236219 | 0.9766549 |
| 205 | bmi:diarrheaChecked | -0.0234469 | 0.9768259 |
| 85 | age:xray_unilateral_infiltrateChecked | -0.0201605 | 0.9800413 |
| 1537 | systolic.min:resp_28 | 0.0187411 | 1.0189178 |
| 1933 | spo2.max:spo2.kurtosis | 0.0179319 | 1.0180937 |
| 1309 | duration_symptoms:spo2_93 | 0.0161090 | 1.0162394 |
| 1442 | heart_rate.min:heart_rate.kurtosis | 0.0157948 | 1.0159202 |
| 470 | hypertensionYes:resp_rate.max | -0.0155562 | 0.9845641 |
| 1086 | dyspneaChecked:diastolic.max | -0.0153870 | 0.9847308 |
| 1421 | heart_rate.min:spo2.max | 0.0148290 | 1.0149395 |
| 827 | feverChecked:xray_bilateral_infiltratesChecked | 0.0139351 | 1.0140327 |
| 397 | diabetesYes:resp_rate.min | -0.0119229 | 0.9881479 |
| 158 | sexMale:diastolic.mean | -0.0108361 | 0.9892224 |
| 1535 | systolic.min:resp_24 | 0.0100020 | 1.0100522 |
| 1544 | systolic.min:systolic.skew | -0.0083029 | 0.9917315 |
| 2086 | resp_32:resp_rate.kurtosis | 0.0073886 | 1.0074160 |
| 2088 | resp_32:spo2.kurtosis | 0.0066133 | 1.0066352 |
| 342 | smoke_vapeYes:heart_rate.mean | -0.0053829 | 0.9946316 |
| 68 | age:hypoxiaYes | -0.0053626 | 0.9946518 |
| 1972 | spo2_85:spo2.kurtosis | -0.0051419 | 0.9948713 |
| 2004 | spo2_90:heart_rate.skew | 0.0043740 | 1.0043836 |
| 449 | hypertensionYes:xray_bilateral_infiltratesChecked | 0.0038531 | 1.0038605 |
| 2114 | diastolic.skew:diastolic.kurtosis | -0.0026672 | 0.9973364 |
| 2142 | diastolic.kurtosis:heart_rate.kurtosis | -0.0014104 | 0.9985906 |
Overall, while the accuracy was higher for the SVM model, the ROC AUC was higher for the tree-based methods. Additionally, the standard error of the cross-validation error from the model training is highest for the Lasso and SVM models, followed by the gradient boosted model, and lowest for the AdaBoost model. Thus, the AdaBoost model still seems to perform the best on the whole and is the model I would recommend for performance. If inference is the goal, the Lasso regression would be the optimal model as it produces interpretable coefficients, as discussed in the Lasso section.
| Gradient boosted | AdaBoost | SVM | Lasso regression | |
|---|---|---|---|---|
| Accuracy | 77.6% | 76.4% | 79.1% | 75.9% |
| AUC | 0.895 | 0.895 | 0.863 | 0.845 |
| SE, CV Training Error | 1.00% | 0.90% | 1.32% | 1.36% |
In addition to the analyses described in the body of the report, I also investigated the usage of derived features generated by PCA and a neural network.
I also investigated using derived features generated by PCA. However, based on the scree plot, the first 17 principal components are needed to explain 80% of the variance (a common “rule of thumb” for percent of variance needed to be explained), so PCA does not enable mapping to a much lower dimensional space, especially compared to the Lasso regression. As the plots of the first 2 and 3 principle components show, the classes are not clearly separable with the first 3 PCs.
I also considered using a neural network to enhance performance. However, it did not appear that the neural network meaningfully improved performance and did not allow for inference, given that it is a true black box model.
I used the AdaBoost model to select the top 20 most important features to include in the neural net model as neural network performance declines if many extraneous features are included.
| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| 0.775000 | 0.771875 | 0.784375 | 0.771875 | 0.771875 |
| 0.775000 | 0.781250 | 0.762500 | 0.768750 | 0.765625 |
| 0.740625 | 0.771875 | 0.775000 | 0.750000 | 0.731250 |
| 0.750000 | 0.740625 | 0.731250 | 0.740625 | 0.771875 |
| 0.728125 | 0.750000 | 0.771875 | 0.778125 | 0.750000 |
Hur, K., et al. Factors Associated With Intubation and Prolonged Intubation in Hospitalized Patients With COVID-19. Otolaryngol Head Neck Surg 163, 170-178 (2020).↩
Clinical management of severe acute respiratory infection (SARI) when COVID-19 disease is suspected. (World Health Organization, 2020).↩